[SPARK-10542] [PYSPARK] fix serialize namedtuple#8707
Conversation
|
Test build #42289 has finished for PR 8707 at commit
|
6b9095b to
1d766aa
Compare
|
Test build #42293 has finished for PR 8707 at commit
|
|
Test build #42295 has finished for PR 8707 at commit
|
|
Test build #42298 has finished for PR 8707 at commit
|
|
Test build #1739 has finished for PR 8707 at commit
|
|
Test build #42304 has finished for PR 8707 at commit
|
|
Over at https://issues.apache.org/jira/browse/SPARK-10544, someone commented to mention that other types of built-in types do not seem to be pickleable in 1.5. For instance, here's the example that they gave: sc.parallelize(["the red", "Fox Runs", "FAST"]).map(str.lower).count()However, this specific example also seems to fail in 1.3.1, so I don't think that this is a regression. Just wanted to mention this discussion here to make sure you were aware of it. |
|
Do you have any intuition for why this worked prior to 1.5 without the changes implemented here? |
There was a problem hiding this comment.
Ah, interesting: presumably P and P2 are different classes but instances created from them are still comparable for equality. Do we also need to check that those instances claim to belong to the same class? It seems way less likely that users could rely on the class comparison behavior, so probably not a huge priority to look at.
There was a problem hiding this comment.
These instances should become to difference classes.
|
Actually, one point of confusion: it looks like |
|
The HACK in serializers.py is used for cPickler, not cloudpickle. |
|
Before 1.5, the old way work in CPython, but not PyPy (we don't have a unit test for it). |
|
@JoshRosen BTW, this patch introduce a special case for namedtuple, it should be safe to merge into branch-1.5. |
|
Empirically, this seems to work, so unless you think that we should investigate the root cause any further I'm fine with giving this an LGTM and merging to 1.5. Feel free to merge yourself, or I can do it. |
|
I tried to find the root cause, but it seems hard to work in all Python versions (you can see them in the older commit), finally switch to current approach. merging this into master and 1.5 branch, thanks! |
Author: Davies Liu <davies@databricks.com> Closes #8707 from davies/fix_namedtuple.
|
FYI, this works for us @ NinthDecimal. Thanks for the fix, it was a stumper! Python 2.7.6 |
|
Hi! I am not sure if this is related but is I look for this issue everything points me here basically. I'm getting When trying to create a data frame from an RDD: rdd = self.sc.textFile(self.input_file_path).map(lambda line: self.process_line(line))
schema = StructType([StructField(u'Variable', StringType(), nullable=False),
StructField(u'Time', TimestampType(), nullable=False),
StructField(u'Value', FloatType(), nullable=False)])
return sql_context.createDataFrame(rdd, schema)I am on PySpark 1.6.0 - any ideas what I'm doing wrong here? |
|
I also get this error when using namedtuples |
Author: Davies Liu <davies@databricks.com> Closes apache#8707 from davies/fix_namedtuple. (cherry picked from commit d5c0361)
No description provided.